We present a blind multichannel speech enhancement method that can handle time-varying layouts of microphones and sound sources.
Nonnegative tensor factorization (NTF) separates a multichannel magnitude spectrogram into source spectrograms without phase information and is robust to time-varying mixing systems, but it usually requires prior knowledge such as spectral templates.
To overcome this limitation, we propose a Bayesian model, robust NTF (Bayesian RNTF), which decomposes a multichannel spectrogram into target speech and noise by exploiting their sparseness and low-rankness.
The method is applied to speech enhancement for a hose-shaped rescue robot with distributed microphones, whose layout changes over time and where some microphones may fail.
Thanks to its formulation of a time-varying mixing system, the proposed method works robustly in such challenging conditions.
Overview of our modelOur hose-shaped rescue robot
Title
Speech Enhancement Based on Bayesian Low-Rank and Sparse Decomposition of Multichannel Magnitude Spectrograms
To simulate rubble disturbing sound propagation, styrene foam boxes and wooden plates were piled up.
A loudspeaker for playing back target speech signals was put 2 m away from this rubble.
The target signals were the four male and female speech recordings calling for rescue in Japanese (e.g., “Ôi” and “Kokoniimasu”) and the loudspeaker was calibrated so that its sound pressure level for each utterance was 80 dB.
The robot was inserted from behind the rubble and captured eight-channel audio signals (mixtures of ego-noise and each target speech) for 10 seconds during the insertion.